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Abstract. Studies have shown that each person is more inclined to en¬ 
joy a group activity when 1) she is interested in the activity, and 2) many 
friends with the same interest join it as well. Nevertheless, even with the 
interest and social tightness information available in online social net¬ 
works, nowadays many social group activities still need to be coordinated 
manually. In this paper, therefore, we first formulate a new problem, 
named Participant Selection for Group Activity (PSGA), to decide the 
group size and select proper participants so that the sum of personal in¬ 
terests and social tightness of the participants in the group is maximized, 
while the activity cost is also carefully examined. To solve the problem, 
we design a new randomized algorithm, named Budget-Aware Random¬ 
ized Group Selection (BARGS), to optimally allocate the computation 
budgets for effective selection of the group size and participants, and we 
prove that BARGS can acquire the solution with a guaranteed perfor¬ 
mance bound. The proposed algorithm was implemented in Facebook, 
and experimental results demonstrate that social groups generated by 
the proposed algorithm significantly outperform the baseline solutions. 


1 Introduction 

Studies have shown that two important factors are usually involved in a per¬ 
son’s decision to join a social group activity: (1) interest in the activity topic 
or content, and (2) social tightness with other attendees [5,8]. For example, if a 
person who appreciates jazz music has complimentary tickets for a jazz concert 
in Rose Theatre, she is inclined to invite her friends or friends of friends who are 
also jazzists. However, even the information on the two factors is now available 
online, the attendees of most group activities still need to be selected manually, 
and the process will be tedious and time-consuming, especially for a large social 
activity, given the complicated social link structure and the diverse interests of 
potential attendees. 

Recent studies have explored community detection, graph clustering and 
graph partitioning to identify groups of nodes mostly based on the graph struc¬ 
ture [1]. The quality of an obtained community is usually measured according 
to its internal structure, together with its external connectivity to the rest of 
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the nodes in the graph [7]. Those approaches are not designed for activity plan¬ 
ning because it does not consider the interests of individual users along with 
the cost of holding an activity with different numbers of participants. An event 
which attracts too few or too many attendees will result in unacceptable loss 
for the planner. Therefore, it is important to incorporate the preference of each 
potential participant, their social connectivity, and the activity cost during the 
planning of an activity. 

With this objective in mind, a new optimization problem is formulated, 
named Participant Selection for Group Activity (PSGA). The problem is given 
a cost function related to the group size and a social graph G, where each node 
represents a potential attendee and is associated with an interest score that de¬ 
scribes the individual level of interest. Each edge has a social tightness score 
corresponding to the mutual familiarity between the two persons. Since each 
participant is more inclined to enjoy the activity when 1) she is interested in the 
activity, and 2) many friends with the same interest join as well, the preference 
of a node Vi for the activity can be represented by the sum of its interest score 
and social tightness scores of the edges connecting to other participants, while 
the group preference is sum of the total interest scores of all participants and the 
social tightness scores of the edges connecting to any two participants. More¬ 
over, the group utility here is represented by the group preference subtracted 
by the activity cost (ex. the expense in food and siting), which is usually cor¬ 
related to the number of participants.^ The objective of PSGA is to determine 
the best group size and select proper participants, so that the group utility is 
maximized. In addition, the induced graph of the set F of selected participants 
is desired to be a connected component, so that each attendee is possible to 
become acquainted with another attendee according to a social path^. 

One possible approach to solving PSGA is to examine every possible combi¬ 
nation on every group size. However, this enumeration approach of group size k 
requires the evaluation of candidate groups, where n is the number of nodes 
in G. Therefore, the number of group size and attendee combinations is 0(2"), 
and it thereby is not feasible in practical cases. Another approach is to incre¬ 
mentally construct the group using a greedy algorithm that iteratively tries each 
group size and sequentially chooses an attendee that leads to the largest incre¬ 
ment in group utility at each iteration. However, greedy algorithms are inclined 
to be trapped in local optimal solutions. To avoid being trapped in local optimal 
solutions, randomized algorithms have been proposed as a simple but effective 
strategy to solve problems with large instances [12]. 

A simple randomized algorithm is to randomly choose multiple start nodes 
initially. Each start node is considered as a partial solution, and a node neigh- 


^ Different weighted coefficients can be assigned to the group utility and activity cost 
according to the corresponding scenario. 

® For some group activities, it is not necessary to ensure that F leads to a connected 
subgraph, and those scenarios can be handled by adding a virtual node v connecting 
to every other node in G, and choosing u in E for PSGA always creates a connected 
subgraph in G U {«}, but F may not be a connected subgraph in G. 
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boring the partial solution is randomly chosen and added to the partial solution 
at each iteration later. Nevertheless, this simple strategy has three disadvan¬ 
tages. Firstly, a start node that has the potential to generate final solutions with 
high group utility does not receive sufficient computational resources for ran¬ 
domization in the following iterations. More specifically, each start node in the 
randomized algorithm is expanded to only one final solution. Thus, a good start 
node will usually fail to generate a solution with high group utility since it only 
has one chance to randomly generate a final solution. The second disadvantage 
is that the expansion of the partial solution does not differentiate the selection 
of the neighboring nodes. Each neighboring node is treated equally and chosen 
uniformly at random for each iteration. Even this issue can be partially resolved 
by assigning the selection probability to each neighboring node according to its 
interest score and the social tightness of incident edges, this assignment will lead 
to the greedy selection of neighbors and thus tends to be trapped in local optimal 
solutions as well. The third disadvantage is that the linear scanning of different 
group sizes is not computationally tractable for real scenarios as an online social 
network contains an enormous number of nodes. 

Keeping the above observations in mind, we propose a randomized algorithm, 
called Budget-Aware Randomized Group Selection (BARGS), to effectively select 
the start nodes, expand the partial solutions, and estimate the suitable group 
size. The computational budget represents the target number of random solu¬ 
tions. Specifically, BARGS first selects a group size limit kmax in accordance 
with the cost function®. Afterward, m start nodes are selected, and neighboring 
nodes are properly added to expand the partial solution iteratively, until kmax 
nodes are included, while the group size corresponding to the largest group util¬ 
ity is acquired finally. Each start node in BARGS is expanded to multiple final 
solutions according to the assigned budget. To properly invest the computa¬ 
tional budgets, each stage of BARGS invests more budgets on the start nodes 
and group sizes that are more inclined to generate good final solutions, according 
to the sampled results from the previous stages. Moreover, the node selection 
probability is adaptively assigned in each stage by exploiting the cross entropy 
method. In this paper, we show that our allocation of computation budgets is 
the optimal strategy, and prove that the solution acquired by BARGS has a 
guaranteed performance bound. 

The rest of this paper is organized as follows. Section 2 formulates PSGA and 
surveys related works. Sections 3 explains BARGS and derives the performance 
bound. User study and experimental results are presented in Section 4, and we 
conclude this paper in Section 5. 


For instance, if the largest capacity of available stadiums for a football game is 
20, 000, kmax is set as 20, 000. 
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2 Preliminary 

2.1 Problem Definition 

Given a social network G = {V,E), where each vertex Vi G V and each edge 
Cij € E are associated with an interest score iji and a social tightness score nj 
respectively, we study a new optimization problem for finding a set F of vertices 
which maximizes the group utility U{F), i.e., 

U{F) = ^ (r/. + ^ TTi,,) - pcm), (1) 

Vi&F VjGF-.CijGE 

where F with |i^| < kmax is a connected subgraph in G to encourage each 
attendee to be acquainted with another attendee with at least one social path in 
F, G is a non-negative activity cost function based on the number of attendees, 
and /3 is a weighted coefficient between the preference and cost. For each node 
i, let rji + e_F-ei denote the preference of node i on the social group 

activity. PSGA is very challenging due to the tradeoff between interest, social 
tightness, and the cost function, while the constraint assuring that F is connected 
also complicates this problem because it is no longer able to arbitrarily choose 
any nodes from G. Indeed, we show that PSGA is NP-hard.. 

Theorem 1. PSGA is NP-Hard. 

Proof. We prove that PSGA is NP-hard with the reduction from DkS problem 
[6]. Given a graph Gd = {Vd,Ed), DkS finds a subgraph with k nodes Fd to 
maximize the density of the subgraph. In other words, the purpose of DkS is to 
maximize the number of edges E(Fu) in the subgraph induced by the selected 
nodes. 

For each instance of DkS, we construct an instance for PSGA by letting 
G = Gd and kmax = oo, where pi of each node Vi G V is set as 0, Tip of each 
edge eip G E is assigned as 1, and /3 = 1, G{i) = 0 for f = fc and G{i) = oo for 
i ^ k. Therefore, PSGA will always select k nodes to avoid creating a negative 
objective value. We first prove the sufficient condition. For each instance of DkS 
with solution node set Fd, we let F = Fd- If the number of edges E{Fd) in the 
subgraph of DkS is 6, the preference of PSGA W(F) is also 6 because F = Fd 
and the optimal group size must be k. We then prove the necessary condition. For 
each instance of PSGA with F, we select the same nodes for Fd, and the number 
of edges E{Fd) must be maximized since the node number in the solution of 
PSGA is k. The theorem follows. □ 

2.2 Related Works 

A recent line of study has been proposed to find cohesive subgroups in social 
networks with different criteria, such as cliques, n-clubs, fc-core, and fc-plex. 
Sariyiice et al. [14] proposed an efficient parallel algorithm to find a fc-core sub¬ 
graph, where every vertex is connected to at least k vertices in the subgraph. 
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Xiang et al. [16] proposed a branch-and-bound algorithm to acquire all maximal 
cliques that cannot be pruned during the search tree optimization. Moreover, 
Hnding the maximum k-plexes was comprehensively discussed in [11]. On the 
other hand, community detection and graph clustering have been exploited to 
identify the subgraphs with the desired structures [1]. The quality of a com¬ 
munity is measured according to the structure inside the community and the 
structure between the community and the rest of the nodes in the graph, such 
as the density of local edges, deviance from a random null model, and conduc¬ 
tance [7]. Nevertheless, the above models did not examine the interest score of 
each user and the social tightness scores between users, which have been re¬ 
garded as crucial factors for social group activities. Moreover, the activity cost 
for the group is not incorporated during the evaluation. 

In addition to dense subgraphs, social groups with different characteristics 
have been explored for varied practical applications. Expert team formation 
in social networks has attracted extensive research interest. The problem of 
constructing an expert team is to Hnd a set of people possessing the required 
skills, while the communication cost among the chosen friends is minimized to 
optimize the rapport among the team members to ensure efficient operation. 
Communication costs can be represented by the graph diameter, the size of 
the minimum spanning tree, and the total length of the shortest paths [9]. By 
contrast, minimizing the total spatial distance with R-Tree from the group with 
a given number of nodes to the rally point is also studied [17]. Nevertheless, this 
paper focuses on a different scenario that aims at identifying a group with the 
most suitable size according to the activity cost, while those selected participants 
also share the common interest and high social tightness. 

3 Algorithm Design for PSGA 

To solve PSGA, a baseline approach is to incrementally constructing the solution 
by sequentially choosing and adding a neighbor node that leads to the largest 
increment in the group preference until kmax people are selected. Afterward, we 
derive the group utility for each k by incorporating the activity cost, 1 < fc < 
kmax, and extract the group size k* with the maximum group utility. 

The greedy algorithm, despite the simplicity, the search space of the greedy 
algorithm is limited and thus tends to be trapped in a local optimal solution, 
because only a single sequence of solutions is explored. To address the above 
issues, this paper proposes a randomized algorithm BARGS to randomly choose 
m start nodes^. BARGS leverages the notion of Optimal Computing Budget 
Allocation (OCBA) [3] to systematically generate the solutions from each start 
node, where the start nodes with more potential to generate the final solutions 
with large group utility will be allocated with more budgets (i.e., expanded to 
more Hnal solutions). In addition, since each start nodes can generate the final 
solutions with different group sizes, the size with larger group utility will be 

The impact of m will be studied in Section 4. 
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associated with more budgets as well (i.e., generated more times). Specifically, 
BARGS includes the following two phases. 

1) Selection and Evaluation of Start Nodes and Group Sizes: This phase first 
selects m start nodes according to the summation of the interest scores and 
social tightness scores of incident edges. Each start node acts as a seed to be 
expanded to a few final solutions. At each iteration, a partial solution, which 
consists of only a start node at the first iteration or a connected set of nodes at 
any iteration afterward, is expanded by randomly selecting a node neighboring 
to the partial solution, until /cmax nodes are included. The group utility of each 
final solution is evaluated to optimally allocate different computational budgets 
to different start nodes and different group sizes in the next phases. 

2) Allocation of Gomputational Budgets: This phase is divided into r stages®, 
while each stage shares the same total computational budget. In the first stage, 
the computational budget allocated to each start node is determined by the 
sampled group utility in the first phase. In each stage afterward, the computa¬ 
tional budget allocated to each start node is adjusted by the sampled results in 
the previous stages. Note that each node can generate different numbers of final 
solutions with different group sizes. The sizes with small group utility sampled 
in the previous stages will be associated with smaller computational budgets in 
the current stage. Therefore, if the activity cost is a convex cost function, the 
cost increases more significantly as the group size grows, and BARGS tends to 
allocate smaller computational budgets and thus generate fewer final solutions 
with large group sizes. 

During the expansion of the partial solutions, we differentiate the probability 
to select each node neighboring to a partial solution. One intuitive way is to 
associate each neighboring node with a different probability according to the sum 
of the interest score and social tightness score on the incident edge. Nevertheless, 
this assignment is similar to the greedy algorithm as it limits the scope to only 
the local information associated with each node, making it difficult to generate 
a final solution with large group utility. By contrast, BARGS exploits the cross 
entropy method [13] according to sampled results in the previous stages in order 
to optimally assign a probability to the edge incident to a neighboring node. 

The detailed pseudocode is presented in Algorithm 1. In the following, we 
first present how to optimally allocate the computational budgets to different 
start nodes and different group sizes. Afterward, we exploit the cross entropy 
method to differentiate the neighbor selection during the expansion of the partial 
solutions. Finally, we derive the approximation ratio of the proposed algorithm. 


Allocation of Computational Budgets Similar to the baseline greedy al¬ 
gorithm, allocating more computational budgets to a start node Vi with larger 
group utility (i.e., pi + X^jj^-eF ei jGB’’’ hi) examines only the local information 
and thus is difficult to generate the solution with large group utility. Therefore, 

® The detailed settings of the parameters of the algorithm, such as m, r, a, and /3 are 
presented in the next section 
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to optimally allocate the computational budgets for each start node and size, we 
first define the solution quality as follows. 

Definition 1. The solution quality, denoted by Q, is defined as the maximum 
group utility of the solution generated from the m start nodes among all sizes. 

For each stage t of phase 2 in BARGS, let Ni^k.t denote the computational 
budgets allocated to the start node Vi with size k in the t-th stage. In the 
following, we first derive the optimal ratio of the computational budgets allocated 
to any two start nodes Vi and Vj with size k and I, respectively. Let two random 
variables Qi^k and Q* f. denote the sampled group utility of any solution and 
the maximal sampled group utility of a solution for start node Vi with size k, 
respectively. If the activity cost is not considered, according to the central limit 
theorem, Qyfe follows the normal distribution when Ni k is large, and it can be 
approximated by the uniform distribution in {ci^k,di^k] as analyzed in OCBA [3], 
where Ci^k and di^k denote the minimum and maximum sampled group utility in 
the previous stages, respectively. On the other hand, when the activity cost is 
considered, the cumulative distribution function is shifted by C{k), and it still 
follows the same distribution. Therefore, we have the following lemma. 

Lemma 1. The probability that the solution generated from the start node Vi 
with size k is better than the solution generated from the start node vj with size 
I, i.e., P{Q*k ^ as follows. 

if djg Si ^i,k- 

if djg S ^i,k (2) 

if di^k S Cjg 


PiQlk < Qh) < < : 7 ( 


1 i djg Cj^k 


2 di k Oi l 


Proof. The cumulative distribution function of Qi^k is 

{ 0 if x < Ci^k- 

if Cj,fc <x< dt^k- 

di,k - Ci^k 

1 otherwise. 

After incorporating the operation cost function CdF]) with \F\ = k, the cumu¬ 
lative distribution function of Qyfc is 

if a; < Ci^k - PC{k). 

if - /3C{k) <x< d^^k - PC{k). (3) 
otherwise. 

Therefore, for the maximal value Q* k, 


f 
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PQlki^) = 

PqiM = Pq.A^)''^-'^- 

From Eq. 3, the cumulative distribution function is shifted by C(k) when we 
incorporate the operation cost, and it thus still follows the same distribution. 
Assume that the probability that the solution generated from the start 

node Vi with size k is better than the solution generated from the start node Vj 
with size /, i.e., P{Q*k ^ be derived according to [15] as follows. 

if dy/ ^ 

if ^j,l P ^iyk 
if dqfc ^ Cj,/ 


P(QU < Qh) < { 


.Ni 


2 di^k C, 


i^k 


The Lemma follows. □ 

Let Vb and kl denote the best start node and best activity size for Vb, re¬ 
spectively. With Lemma 1, BARGS in each stage allocates the computational 
budgets to different start nodes as follows. 

^ ^ P{Q = Q*) 

Nj,t P{Q = Q*Y 


where P{Q = Q*) = Y2kPiQik — Qlk*)^ the ratio of the computational 
budget allocation is optimal in OCBA [3], which implies that any other allocation 
generates a smaller Q. Note that if the allocated computational budgets for a 
start node is 0 in the t-th stage, we prune off the start node in the any stage 
afterward. After deriving the computational budget Ni^t for each start node Vi, 
we distribute the budgets to the solutions with different group sizes. Let Ni^k,t 
denote the number of solutions with group size k from the start node Vi. 


N, 


i,k,t 


= N, 


PiQlk > Ql.kt) 




12k PiQi,k — Qb,kY 


(5) 


It is worth noting that when we generate a solution with size k, the solutions 
from size 1 to size k — 1 are also generated as well. Therefore, to avoid generating 
an excess number the solutions with small group sizes, it is necessary to relocate 
the computation budgets. Let Ni^k,t denote the reallocated budget of start node 
Vi with size k in t-th stage. BARGS reallocates the computational budgets from 
size A: — 1 as follows. 


= max(0, ^ (6) 

l>k 

Specifically, after deriving with Eq. 5, BARGS derives Ni^k,t from k = kmax 
to 1. Initially, Afterward, for k 

— kmax I5 if P^i,kmax — 'i-,t is 
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Size 

Cost 

1 

400 

2 

300 

3 

200 

4 

350 

5 

500 

6 

650 


k 

C2.k 

d-2,k 

C6,k 

de.k 

1 

-3.4 

-3.4 

-3.4 

-3.4 

2 

-2.3 

-1.1 

-1.6 

-1.2 

3 

-1.3 

1.6 

-0.4 

0.8 

4 

-1.5 

1.1 

-1.2 

0.6 


Fig. 1. Illustrative example of BARGS 


equal to Ni^kmax.tj it is not necessary to generate additional solutions with size 
kmax — 1 since they have been created during the generation of the solutions with 
size kmax- In this case, is 0. Otherwise, BARGS sets Ni^kmax-i,t = 

Ni^kmax-i,t~Ni^^kmax,t- The above process repeats until k = 1. Since the number 
of solutions with size k is still Ni^k,t, the computational budget allocation is still 
optimal as shown in Eq. 4. 

Neighboring Node Differentiation To effectively differentiate neighbor se¬ 
lection, BARGS takes advantage of the cross entropy method [13] to achieve 
importance sampling by adaptively assigning a different probability to each 
neighboring node from the sampled results in previous stages. Take start node 
Vi with size k as an example, after collecting samples Xi^k,i, Xi^k, 2 , 

Xi^k,q, Xi^k,Ni,k,i generated from start node Vi, BARGS calculates the total 
group utility U{Xi^k,q) for each sample and sorts them in the descending order, 
C^(i) > Let denotes the group utility of the top-p perfor¬ 

mance sample, i.e. = bf(fpAri ^ i]) • With those sampled results, we set the 
selection probability Pi,k,t+i,j of every node Vj in iteration t -|- 1 from the partial 
solution expanded from node vt by fitting the distribution of top-p performance 
samples as follows. 

Definition 2. A Bernoulli sample vector, denoted as Xi^k,q = {xi^k,q,i, 
■■■■,Xi^k,q,n), is defined to be the q-th sample vector from start node Vi, where 
Xi,k,q,j is 1 if node Vj is selected in the q-th sample and 0 otherwise. 

_ l^q=l ^{U{Xi^k,g)>'ri,k,t}^i,k,q,j . 

Pt,k,t+l,j - r ’ 

where I{u(Xi fc t} ^ T the group utility of sample Xi^k,q exceeds a thresh¬ 

old 7 pfey G R, and 0 otherwise. Intuitively, the neighbor that tends to generate a 
better solution will be assigned a higher selection probability. As shown in [13], 
the above probability assignment scheme has been proved to be optimal from the 
perspective of cross entropy. Eq. 7 minimizes the Kullback-Leibler cross entropy 
(KL) distance between node selection probability f^'i^k,t-ki &nd the distribution 
of top-p performance samples, such that the performance of random samples in 
{t l)-th stage is guaranteed to be closest to the top-p performance samples in 
t-th stage. 

Example 1. Figure 1 presents an illustrative example with a social network of 
size 6. For the greedy algorithm, is hrst selected since its interest score is the 
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maximum among all nodes, i.e., 0.7. Afterward, node vg is extracted with total 
preference of 0.7 + 0.5 + 0.6 = 1.8. V 4 , instead of V 2 or ^ 3 , is chosen because it 
generates the largest increment of preference, i.e., 0.7, and leads to a group with 
total preference of 2.5. After vi is further selected with the increment of 1.1, V 2 is 
selected with total preference of 4.9. Finally, V 3 is selected with total preference of 
5.6. Assume that the weighting j3 between preference and cost function is 0.01®, 
the greedy algorithm scans each size to obtain the best size, i.e., calculating the 
maximum among 0.7 — 0.01 • 400, 1.8 — 0.01 • 300, 2.5 — 0.01 • 200, 3.6 — 0.01 • 350, 
4.9 — 0.01 • 500, and 5.6 — 0.01 • 650, and obtains the best size is 3 with group 
utility of 0.5. In this simple example, the above algorithm is not able to find the 
optimal solution since it facilitates the selection of nodes only suitable at the 
corresponding iterations. 

We also take Figure 1 as an illustrative example for BARGS with kmax = 
4. Phase 1 first chooses \n/kmax\ = 2 start nodes by summing up the topic 
interest score and the social tightness scores for every node. Therefore, V 2 with 
0.6 + 0.7+0 .6 + 0.9 —0.6 = 2.2 and vg with 0.6 + 0.5 + 0.7 = 1.8 are selected. Next, 
let T = 20, Pb = 0.7 and a = 0.9 in this example, and the number of stages is 
thus r < ^ ^ Each start node generates 5 samples in the 

first stage. The intermediate solution obtained so far is denoted as Vs, and the 
candidate attendees extracted so far is denoted as Va- Therefore, by selecting V 2 
as a start node, the total group utility of Vs = {^ 2 } is 0.6 — 0.01 • 400 = —3.4, 
and Va = {vi,V 3 ,V 4 ,vg}. Since the node selection probability is homogeneous 
in the first stage, we randomly select vi from Va to expand Vs. Now the total 
group utility of Vs = {^ 1 ,^ 2 } is U{Vs) = 0.6 + 0.7 + 0.6 — 0.01 • 300 = —1.1, and 
Va = {v 3 ,V 4 ,V 5 }. The process of expanding Vs continues until the cardinality 
of Vs reaches kmax = 4, e.g. vg and then V 3 . Afterward, we record the first 
sample result -A 2 , 2 .i = ( 1 , 1 , 0 , 0 , 0 , 0 ) with the total group utility of — 1 . 1 , the 
worst result of V 2 with size 2 ( 02,2 = ~ 1 - 1 ), and the best result of V 2 with 
size 2 (d 2,2 = -1.1). Similarly, ^ 2 , 3,1 = (1,1, 0,0,1,0) with the total group 
utility of —1 and -^ 2,44 = (1,1,1,0,1,0) with the total group utility of 0.7. The 
second sampled results from start node V 2 are {v 2 , V 3 ,V 4 ,vi}. Therefore, X 2 ^ 2,2 = 
(0,1,1,0, 0,0) with the total group utility of —1.4, X 2 , 3,2 = (0,1,1,1,0, 0) with 
the total group utility of 0 . 8 , ^ 24,2 = ( 1 , 1 , 1 , 1 , 0 , 0 ) with the total group utility 
of 1.2. Afterward, the worst and the best results of V 2 are updated to 02,2 = -1.4, 
d 2,2 = —1-1, C 2,3 = —1, d 2,3 = 0.8, C 2,4 = 0.7, and d 2,4 = 1.2. After drawing 3 
more samples from node V 2 , we repeat the above process for start node vg with 
5 samples. The results are summarized on the right of Figure 1. 


To allocate the computational budgets for the second stage, i.e., r = 2, we 
first hnd the allocation ratio N 2^2 '■ A^ 6 , 2 =^((+ 1 + ( ^ gl[li 3 ] 

hi rlgllll:!) ? + ( nel+nsi )') =1-39 : 0.32. Therefore, the al¬ 
located computational budgets for start nodes V 2 and vg are « 8 and 


« 2 , respectively. A^ 2 , 2 , 2 , A^ 2 , 3 , 2 , and N 2 a ,2 approximate 0 , w 6 , 

and ~ 2, respectively. BARGS reallocates the computational budgets by 


® The parameter setting of a will be introduced in more details in the next section. 
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-^ 2 , 3,2 = -^ 2 , 3 , 2 —-^ 2 , 4,2 = 6. Afterward, we update the node selection probability. 
Take the node selection probability for start node V 2 with size 3 in the second 
stage for node vi as an example, i.e., P 2 , 3 , 2 , 1 - Given p = 0.6, i.e., BARGS selects 
top-3 performance samples, if vi is selected 2 times in top-3 performance samples, 
P 2 , 3 , 2 ,i is set as |. The process for ug is similar and thus omitted here due to the 
space constraint. After the second stage, the optimal solution is {vi,V 2 , V 4 } with 
maximum group utility of 1.6, which is better than the group utility generated 
by the greedy algorithm, i.e., 0.5. 


Theoretical Results The following theorem first analyzes the probability 
P{Q = Ql^k*) that Vb, as decided according to the samples in the previous 
stages, is actually the start node that generates the maximal group utility with 
optimal size k^. Let a denote the closeness ratio between the maximum of the 
start node with the maximal group utility and the maximum of other start nodes 
or with different sizes, i.e., a = {da,k* — Cb,k*) / idb,k‘ —Cb,k*), where Va generates 
the maximal group utility among other start nodes. Therefore, in addition to 0 
and 1, a is allowed to be any other value from 0 to 1. 


Theorem 2. For PSGA with parameter {m,T,kmax), where m is the number 
of start nodes, T is the total computational budgets, and kmax is the group size 
limit, the probability P{Q = that Vb selected according to the previous 

stages is actually the start node that generates the optimal solution with optimal 
size kl is at least 1 — ^{kmax + m — . 


Proof. According to the Bonferroni inequality, p{r\ffi{Yi < 0)} > 1 — ~ 

p{Yi < 0)]. In our case, Yi is replaced by — Ql j,. to acquire a lower bound 
for the probability that Vb enjoys the maximal group utility with optimal size 
kl- Therefore, by using Equation 2, 


P{Q = Qlki) 

= {Ql,i - Qlki) < 0)} • 

p{F^Zi,i^b{Qlk: - Ql,ki < 0 )} 

^max 

>(i- ^ [i-PiQlk:-Qlki<o)]) 

m 

{i- Y. \^-p(QiK-Qi,ki<m 


>(i 

(1 


'^max 

E 


^ db,i Cb,k; ^ 


..mi 

1 ^ ai,k* - Cb,kl ^ 

2 . "f-L, db,k‘ - Cb,k* 

1=1,i^b ’ b ’ b 
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By introducing a, P{Q = Qh^*) is greater than 

(1 - \{kma. - 

> 1 - ^{kmax + m- 2)0.^^'“^ 

1 T 

> 1 - -^{kmax + rn- 2)a’-“'=max . 


The theorem follows. 


□ 


Given the total budgets T and a general cost function, i.e., without any 
assumption, the following theorem derives a lower bound of the solution obtained 
by BARGS. 


Theorem 3. The maximum group utility E[Q] from the solution of BARGS is 
at least ( j>r ^ • Q*, where Nh kr after r stapes is T, 




Q* is the optimal solution for a PSGA problem in r-stage computational budget 
allocation, and k^ is the optimal group size of the best node Vb. 


Proof. It is challenging to derive the performance bound without any assumption 
on the cost function due to (1) no useful properties such as such as monotonic¬ 
ity, submodularity, and convexity, so it is impossible to estimate the performance 
according to the size, and (2) the cost function can dominate the performance 
bound or be neglected according to fS. However, the cumulative distribution func¬ 
tion of Q*f. follows the Gaussian distribution regardless to i and k. Therefore, 
we analyze the performance bound by regarding each combination of as a 
sampling result of different start nodes. 

Notice that, given a fixed size k, the maximum preference from the solu¬ 
tion of BARGS from the best node Vb without the cost function is at least 
-^b( ^ ■ <3*: where Nb after r stages is T, and Q* is the opti¬ 

mal solution for a PSGA problem without cost function in r-stage computational 
budget allocation. Therefore, 


E[Q]>Nb,kT{ 


N, 


b.kT 


A 


l+N, 


b.k- 


Q* 


( 8 ) 


If the computational budget allocation is r—stages with T > fema^mr 

Nb fc- is f -b which is The theorem follows. □ 

rmkmax 2 2r ’ ^rmkmax 


Time Complexity of BARGS. The time complexity of BARGS contains 
two parts. The first phase selects m start nodes with 0{E -b n-b mlogn) time, 
where 0{E) is to sum up the interest and social tightness scores, 0{n-\-m log n) is 
to build a heap and extract m nodes with the largest sum. Afterward, the second 
phase of BARGS includes r stages, and each stage allocates the computational 
resources with 0 {m) time and generates O(^) new partial solutions with at most 
kmax nodes for all start nodes. Therefore, the time complexity of the second 
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phase is O {r{m + ^kmax)) = 0{kmaxT), and BARGS therefore needs 0{E + 
mlogn -I- kmaxT) running time. 

4 Experimental Results 

4.1 Experiment Setup 

We implement BARGS in Facebook and invite 50 people from various commu¬ 
nities, e.g., schools, government, technology companies, and businesses to join 
our user study. We compare the solution quality and running time of manual 
coordination and BARGS for answering PSGA problems, to evaluate the need of 
an automatic group recommendation service. Each user is asked to plan 5 social 
activities with the social graphs extracted from their social networks in Face- 
book. The interest scores follow the power-law distribution with the exponent 
as 2.5 according to the recent analysis [4] on real datasets. The social tightness 
score between two friends is derived according to the number of common friends, 
which represents the proximity interaction [2], and the probability of negative 
weights [10]. Then, the weighted coefficient A on social tightness scores and in¬ 
terest scores and the weighted coefficient /3 on group preference and activity 
cost in Footnote 4 are set as the average value specified by the 50 people, i.e., 
A = 0.527 and (3 = 0.514. Most importantly, after the scores are returned by 
the above renowned models, each user is allowed to fine-tune the two scores by 
themselves. In addition to the user study, three real datasets are evaluated in the 
experiment. The first dataset is crawled from Facebook with 90, 269 users in the 
New Orleans network^°. The second dataset is crawled from DBLP dataset with 
511,163 nodes and 1,871, 070 edges. The third dataset, Flickr^^, with 1,846,198 
nodes and 22,613, 981 edges, is also incorporated to demonstrate the scalability 
of the proposed algorithms. 

In this paper, the activity cost is modelled by a piecewise linear function, 
which can approximate any non-decreasing functions. We set the activity cost 
according to the auditorium cost and other related cost in Duke Energy Center^^. 

{ 400 - A: if 0 < fc < 100. 

850 -k if 100 < A: < 600. 

2200 -k if 600 < A: < 1750. 

We compare deterministic greedy (DGreedy), randomized greedy (RGreedy), 
and BARGS in an HP DL580 server with four Intel E7-4870 2.4 GHz GPUs and 
128 GB RAM. RGreedy first chooses the same m start nodes as BARGS. At each 
iteration, RGreedy calculates the preference increment of adding a neighboring 
node Vj to the intermediate solution Vs obtained so far for each neighboring 

http: //socialnetworks .mpi-sws. org/data-wosn2009 .html 
^ http://socialnetworks.mpi-sws.org/data-imc2007.html 
http://www.dukeenergycenterraleigh.com/uploads/venues/rental/ 
5-rateschedule.pdf 
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Fig. 2. Results of user study 

node, and sums them up as the total preference increment. Afterward, RGreedy 
sets the node selection probability of each neighbor as the ratio of the corre¬ 
sponding preference increment to the total preference increment, similar to the 
concept in the greedy algorithm. Notice that the computation budgets represent 
the number of generated solutions. With more computation budgets, RGreedy 
generates more solutions of group size kmax , examines the group utility by sub¬ 
tracting the activity cost from group size 1 to kmax, and selects the group with 
maximum group utility. It is worth noting that RGreedy is computationally in¬ 
tensive and not scalable to support a large group size because it is necessary to 
sum up the interest scores and social tightness scores during the selection of a 
node neighboring to each partial solution. Therefore, we can only present the 
results of RGreedy with small group sizes. 

The default m in the experiment is set as n/kmax since n/kmax groups can 
be acquired from a network with n nodes if each group has kmax participants. 
The default cross-entropy parameters p and a are set as 0.3 and 0.99 as rec¬ 
ommended by the cross-entropy method [13]. Since BARGS natively supports 
parallelization, we also implemented them with OpenMP for parallelization, to 
demonstrate the gain in parallelization with more CPU cores. 


4.2 User Study 

Figures 2(a)-(c) compare manual coordination and BARGS in the user study. In 
addition, the optimal solution is also derived with the enumeration method since 
the network size is very small. Figures 2(a) and (b) present the solution quality 
and execution time with different network sizes. The result indicates that the 
solutions obtained by BARGS are identical to the optimal solutions, but users 
are not able to acquire the optimal solutions even when n = 5. As n increases, 
the solution quality of manual coordination degrades rapidly. We also compare 
the accuracy of selecting the optimal group size in Figure 2(c). As n increases, 
it becomes more difficult for a user to correctly identify the optimal size, while 
BARGS can always select the optimal one. Therefore, it is desirable to deploy 
BARGS as an automatic group recommendation service, especially to address 
the need of a large group in a massive social network nowadays. 


4.3 Performance Comparison and Sensitivity Analysis 

Figure 3(a) compares the execution time of DGreedy, RGreedy, and BARGS by 
sampling different numbers of nodes from Facebook data. DGreedy is always the 
fastest one since it is a deterministic algorithm and generates only one final so¬ 
lution, whereas RGreedy requires more than 10® seconds. The results of RGreedy 
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Fig. 3. Experimental results on Facebook and DBLP datasets 


do not return in 2 days as n increases to 10000. To evaluate the performance of 
BARGS with multi-threaded processing, Figure 3(b) shows that we can accel¬ 
erate the processing speed to 7.2 times with 8 threads. The acceleration ratio is 
slightly lower than 8 because OpenMP forbids different threads to write at the 
same memory position at the same time. Therefore, it is expected that BARGS 
with parallelization is promising to be deployed as a value-added cloud service. 

In addition to the running time. Figure 3(c) compares the solution quality 
of different approaches. The results indicate that BARGS outperforms DGreedy 
and RGreedy, especially under a large n. The group utility of BARGS is 45% 
better than the one from DGreedy when n = 50000. On the other hand, RGreedy 
outperforms DGreedy since it has a chance to jump out of the local optimal 
solution. 

Figures 3(d) and (e) compare the execution time and solution quality of two 
randomized approaches under different total computational budgets, i.e., T. As 
T increases, the solution quality of BARGS increases faster than that of RGreedy 
because it can optimally allocate the computation resources. Even though the 
solution quality of RGreedy is closer to BARGS in some cases, BARGS is much 
faster than RGreedy by an order of 10“^. 

Figures 3(f) and (g) present the execution time and solution quality of 
RGreedy and BARGS with different numbers of start nodes, i.e., m. The results 
show that the solution quality in Figure 3(g) is almost the same as m increases, 
demonstrating that it is sufficient for m to be set as a value smaller than — 

^max 

as recommended by OCBA [3]. The running time of BARGS for m = 2 is only 
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60% of the running time for m = 4 as shown in Figure 3(f), while the solution 
quality remains almost the same. 

BARGS is also evaluated on the DBLP dataset. Figures 3(h) and (i) show 
that BARGS outperforms DGreedy by 50% and RGreedy by 26% in solution 
quality when n = 500000. BARGS is still faster than RGreedy by an order of 
10“^. However, RGreedy runs faster on the DBLP dataset than on the Facebook 
dataset, because the DBLP dataset is a sparser graph with an average node 
degree of 3.66. Therefore, the number of candidate nodes to be chosen during the 
expansion of the partial solution in the DBLP dataset increases much more slowly 
than in the Facebook dataset with an average node degree of 26.1. Nevertheless, 
RGreedy is still not able to generate a solution for a large network size n due to 
its unacceptable efficiency. 

5 Conclusion 

To the best of our knowledge, there is no real system or existing work in the 
literature that addresses the issues of scale-adaptive group optimization for so¬ 
cial activity planning based on topic interest, social tightness, and activity cost. 
To fill this research gap and satisfy an important practical need, this paper for¬ 
mulated a new optimization problem called PSGA to derive a set of attendees 
and maximize the group utility. We proved that PSGA is NP-hard and devised 
a simple but effective randomized algorithms, namely BARGS, with a guaran¬ 
teed performance bound. The user study demonstrated that the social groups 
obtained through the proposed algorithm implemented in Facebook significantly 
outperforms the manually configured solutions by users. This research result thus 
holds much promise to be profitably adopted in social networking websites as a 
value-added service. 
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Algorithm 1 BARGS 


Input: Graph G(V,E), social network size n, activity cost function C{k), maximum 
group size kmax, correctly select probability P{CS), solution quality Q, percentile 
of CE p, and smoothing weighting w 
Output: The best group F generating maximum willingness 
1: Ci = 00 , di = 0 for all i; 

to candidate set AI; 


3: Select m candidate nod 


4: Ti = m 


5: 


6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
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23 
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26 
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33 

34 

35 
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37 

38 

39 
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41 

42 

43 

44 

45 

46 

47 


Find the number of stages r by hrst consulting Nb table with solution q, and 

+ lj; 

for t = 1 to r do 
if t = 1 then 

for i = 1 to m do 

A- = E- 

^ m ’ 

Set the node selection probability vector pi^t as uniform; 

else 

^totai ~ 0; 

for i = 1 to m do 

A 1 f dj-Ch \Nh. 

2 (dt-cJ ’ 

A^total — Afota/d- Aj, 

Ai= TlAi j Atotal] 

for i = 1 to m do 


Vs = Mi 

Va = % 

X = % 


for a: = 1 to Ai do 

Va = N{Mi) 

for fc = 1 to kmax — 1 do 

Random select a node v in Va in accordance with pi^k,t to Vs; 
VA = Vk U N{v) 
u = U{Vs); 

X.add(Vs, u); 
if M > di^k then 
di^k — r/; 
if w < Ci,fc then 


Ci,k = w; 

if w > W{F) then 
b = i\ 

F = Vs; 

{Update node selection probability pi^k,t+i} 
X=DescendingSort{X, u); 
if 7 t > X(fpA^-)).w then 
7t+i = 7t; 

else 


7t+i = ^([pAiD-w; 
for all Sample a; in A do 
if x.u > 7t+i then 
for all Vj £ X do 

Pi,k,t+l,j = Pi,k,t + lJ + Ij 
for 7 = 1 to n do 

Pi,,k,t-\-'^,j ~Pi,k,tAlp! rpAi"]; 

Pi,k,j,t+1 ~ yjPi,tA^,j 4“ (1 yj')Pi,k,t,jl 
Output F; 
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